11 research outputs found
Recommended from our members
High-dimensional Feature Selection and Multi-level Causal Mediation Analysis with Applications to Human Aging and Cluster-based Intervention Studies
Many questions in public health and medicine are fundamentally causal in that our objective is to learn the effect of some exposure, randomized or not, on an outcome of interest. As a result, causal inference frameworks and methodologies have gained interest as a promising tool to reliably answer scientific questions. However, the tasks of identifying and efficiently estimating causal effects from observed data still pose significant challenges under complex data generating scenarios. We focus on (1) high-dimensional settings where the number of variables is orders of magnitude higher than the number of observations; and (2) multi-level settings, where study participants are grouped into clusters and the exposure is assigned at the cluster level.
First, we propose a novel adaptation of the Super Learner algorithm for the task of feature selection in high-dimensional settings. In simulations and with real data, we demonstrate that our proposed approach improves the accuracy for identifying potential causes of a target variable by using a novel measure of variable importance, and by combining a library of feature selection algorithms.
Second, we consider the task of estimating ‘biological age’ from a set of age-dependent variables of potentially high dimensions (e.g., -omics). We propose a new method for calculating biological age that is based on an adaptation of the algorithm presented in chapter 2. Then, we develop an approach to evaluate, compare, and combine different approaches to biological age estimation with the goal of constructing age-related disease risk scores which could potentially aide in diagnosis and prognosis of age-related diseases.
Third, we turn our attention to causal mediation analysis in a multi-level setting where the exposure is assigned at the cluster level, but the mediator and outcomes are measured at the participant level. We extend the general hierarchical causal model to include mediating variables. We adapt the mediation effects that arise from the population intervention effect (PIE) via stochastic interventions on the exposure to the multi-level setting
A Primer on Causality in Data Science
Many questions in Data Science are fundamentally causal in that our objective
is to learn the effect of some exposure, randomized or not, on an outcome
interest. Even studies that are seemingly non-causal, such as those with the
goal of prediction or prevalence estimation, have causal elements, including
differential censoring or measurement. As a result, we, as Data Scientists,
need to consider the underlying causal mechanisms that gave rise to the data,
rather than simply the pattern or association observed in those data. In this
work, we review the 'Causal Roadmap' of Petersen and van der Laan (2014) to
provide an introduction to some key concepts in causal inference. Similar to
other causal frameworks, the steps of the Roadmap include clearly stating the
scientific question, defining of the causal model, translating the scientific
question into a causal parameter, assessing the assumptions needed to express
the causal parameter as a statistical estimand, implementation of statistical
estimators including parametric and semi-parametric methods, and interpretation
of our findings. We believe that using such a framework in Data Science will
help to ensure that our statistical analyses are guided by the scientific
question driving our research, while avoiding over-interpreting our results. We
focus on the effect of an exposure occurring at a single time point and
highlight the use of targeted maximum likelihood estimation (TMLE) with Super
Learner.Comment: 26 pages (with references); 4 figure
Recommended from our members
GEMINI: a computationally-efficient search engine for large gene expression datasets
Background
Low-cost DNA sequencing allows organizations to accumulate massive amounts of genomic data and use that data to answer a diverse range of research questions. Presently, users must search for relevant genomic data using a keyword, accession number of meta-data tag. However, in this search paradigm the form of the query – a text-based string – is mismatched with the form of the target – a genomic profile. Results
To improve access to massive genomic data resources, we have developed a fast search engine, GEMINI, that uses a genomic profile as a query to search for similar genomic profiles. GEMINI implements a nearest-neighbor search algorithm using a vantage-point tree to store a database of n profiles and in certain circumstances achieves an O(log n) expected query time in the limit. We tested GEMINI on breast and ovarian cancer gene expression data from The Cancer Genome Atlas project and show that it achieves a query time that scales as the logarithm of the number of records in practice on genomic data. In a database with 105samples, GEMINI identifies the nearest neighbor in 0.05 sec compared to a brute force search time of 0.6 sec. Conclusions
GEMINI is a fast search engine that uses a query genomic profile to search for similar profiles in a very large genomic database. It enables users to identify similar profiles independent of sample label, data origin or other meta-data information
Recommended from our members
GLAD: A mixed-membership model for heterogeneous tumor subtype classification
MOTIVATION: Genomic analyses of many solid cancers have demonstrated extensive genetic heterogeneity between as well as within individual tumors. However, statistical methods for classifying tumors by subtype based on genomic biomarkers generally entail an all-or-none decision, which may be misleading for clinical samples containing a mixture of subtypes and/or normal cell contamination. RESULTS: We have developed a mixed-membership classification model, called glad, that simultaneously learns a sparse biomarker signature for each subtype as well as a distribution over subtypes for each sample. We demonstrate the accuracy of this model on simulated data, in-vitro mixture experiments, and clinical samples from the Cancer Genome Atlas (TCGA) project. We show that many TCGA samples are likely a mixture of multiple subtypes
Recommended from our members
Sperm DNA methylation mediates the association of male age on reproductive outcomes among couples undergoing infertility treatment
Parental age at time of offspring conception is increasing in developed countries. Advanced male age is associated with decreased reproductive success and increased risk of adverse neurodevelopmental outcomes in offspring. Mechanisms for these male age effects remain unclear, but changes in sperm DNA methylation over time is one potential explanation. We assessed genome-wide methylation of sperm DNA from 47 semen samples collected from male participants of couples seeking infertility treatment. We report that higher male age was associated with lower likelihood of fertilization and live birth, and poor embryo development (p \u3c 0.05). Furthermore, our multivariable linear models showed male age was associated with alterations in sperm methylation at 1698 CpGs and 1146 regions (q \u3c 0.05), which were associated with \u3e 750 genes enriched in embryonic development, behavior and neurodevelopment among others. High dimensional mediation analyses identified four genes (DEFB126, TPI1P3, PLCH2 and DLGAP2) with age-related sperm differential methylation that accounted for 64% (95% CI 0.42–0.86%; p \u3c 0.05) of the effect of male age on lower fertilization rate. Our findings from this modest IVF population provide evidence for sperm methylation as a mechanism of age-induced poor reproductive outcomes and identifies possible candidate genes for mediating these effects
Additional file 1 of GEMINI: a computationally-efficient search engine for large gene expression datasets
Supplementary information. (PDF 81 KB
DNA methylation profiles reveal sex-specific associations between gestational exposure to ambient air pollution and placenta cell-type composition in the PRISM cohort study
Abstract Background Gestational exposure to ambient air pollution has been associated with adverse health outcomes for mothers and newborns. The placenta is a central regulator of the in utero environment that orchestrates development and postnatal life via fetal programming. Ambient air pollution contaminants can reach the placenta and have been shown to alter bulk placental tissue DNA methylation patterns. Yet the effect of air pollution on placental cell-type composition has not been examined. We aimed to investigate whether the exposure to ambient air pollution during gestation is associated with placental cell types inferred from DNA methylation profiles. Methods We leveraged data from 226 mother–infant pairs in the Programming of Intergenerational Stress Mechanisms (PRISM) longitudinal cohort in the Northeastern US. Daily concentrations of fine particulate matter (PM2.5) at 1 km spatial resolution were estimated from a spatiotemporal model developed with satellite data and linked to womens’ addresses during pregnancy and infants’ date of birth. The proportions of six cell types [syncytiotrophoblasts, trophoblasts, stromal, endothelial, Hofbauer and nucleated red blood cells (nRBCs)] were derived from placental tissue 450K DNA methylation array. We applied compositional regression to examine overall changes in placenta cell-type composition related to PM2.5 average by pregnancy trimester. We also investigated the association between PM2.5 and individual cell types using beta regression. All analyses were performed in the overall sample and stratified by infant sex adjusted for covariates. Results In male infants, first trimester (T1) PM2.5 was associated with changes in placental cell composition (p = 0.03), driven by a decrease [per one PM2.5 interquartile range (IQR)] of 0.037 in the syncytiotrophoblasts proportion (95% confidence interval (CI) [− 0.066, − 0.012]), accompanied by an increase in trophoblasts of 0.033 (95% CI: [0.009, 0.064]). In females, second and third trimester PM2.5 were associated with overall changes in placental cell-type composition (T2: p = 0.040; T3: p = 0.049), with a decrease in the nRBC proportion. Individual cell-type analysis with beta regression showed similar results with an additional association found for third trimester PM2.5 and stromal cells in females (decrease of 0.054, p = 0.024). Conclusion Gestational exposure to air pollution was associated with placenta cell composition. Further research is needed to corroborate these findings and evaluate their role in PM2.5-related impact in the placenta and consequent fetal programming
Recommended from our members
Sperm epigenetic clock associates with pregnancy outcomes in the general population.
Study questionIs sperm epigenetic aging (SEA) associated with probability of pregnancy among couples in the general population?Summary answerWe observed a 17% lower cumulative probability at 12 months for couples with male partners in the older compared to the younger SEA categories.What is known alreadyThe strong relation between chronological age and DNA methylation profiles has enabled the estimation of biological age as epigenetic 'clock' metrics in most somatic tissue. Such clocks in male germ cells are less developed and lack clinical relevance in terms of their utility to predict reproductive outcomes.Study design, size, durationThis was a population-based prospective cohort study of couples discontinuing contraception to become pregnant recruited from 16 US counties from 2005 to 2009 and followed for up to 12 months.Participants/materials, setting, methodsSperm DNA methylation from 379 semen samples was assessed via a beadchip array. A state-of-the-art ensemble machine learning algorithm was employed to predict age from the sperm DNA methylation data. SEA was estimated from clocks derived from individual CpGs (SEACpG) and differentially methylated regions (SEADMR). Probability of pregnancy within 1 year was compared by SEA, and discrete-time proportional hazards models were used to evaluate the relations with time-to-pregnancy (TTP) with adjustment for covariates.Main results and the role of chanceOur SEACpG clock had the highest predictive performance with correlation between chronological and predicted age (r = 0.91). In adjusted discrete Cox models, SEACpG was negatively associated with TTP (fecundability odds ratios (FORs)=0.83; 95% CI: 0.76, 0.90; P = 1.2×10-5), indicating a longer TTP with advanced SEACpG. For subsequent birth outcomes, advanced SEACpG was associated with shorter gestational age (n = 192; -2.13 days; 95% CI: -3.67, -0.59; P = 0.007). Current smokers also displayed advanced SEACpG (P < 0.05). Finally, SEACpG showed a strong performance in an independent IVF cohort (n = 173; r = 0.83). SEADMR performance was comparable to SEACpG but with attenuated effect sizes.Limitations, reasons for cautionThis prospective cohort study consisted primarily of Caucasian men and women, and thus analysis of large diverse cohorts is necessary to confirm the associations between SEA and couple pregnancy success in other races/ethnicities.Wider implications of the findingsThese data suggest that our sperm epigenetic clocks may have utility as a novel biomarker to predict TTP among couples in the general population and underscore the importance of the male partner for reproductive success.Study funding/competing interest(s)This work was funded in part by grants the National Institute of Environmental Health Sciences, National Institutes of Health (R01 ES028298; PI: J.R.P. and P30 ES020957); Robert J. Sokol, MD Endowed Chair of Molecular Obstetrics and Gynecology (J.R.P.); and the Intramural Research Program of the Eunice Kennedy Shriver National Institute of Child Health and Human Development, National Institutes of Health, Bethesda, Maryland (Contracts N01-HD-3-3355, N01-HD-3-3356 and N01-HD-3-3358). S.L.M. was supported by the Intramural Research Program of the Eunice Kennedy Shriver National Institute of Child Health and Human Development, National Institutes of Health. The authors declare no competing interests.Trial registration numberN/A
Recommended from our members
Population-level viral suppression among pregnant and postpartum women in a universal test and treat trial.
Objective(s)We sought to determine whether universal 'test and treat' (UTT) can achieve gains in viral suppression beyond universal antiretroviral treatment (ART) eligibility during pregnancy and postpartum, among women living with HIV.DesignA community cluster randomized trial.MethodsThe SEARCH UTT trial compared an intervention of annual population testing and universal ART with a control of baseline population testing with ART by country standard, including ART eligibility for all pregnant/postpartum women, in 32 communities in Kenya and Uganda. When testing, women were asked about current pregnancy and live births over the prior year and, if HIV-infected, had their viral load measured. Between arms, we compared population-level viral suppression (HIV RNA <500 copies/ml) among all pregnant/postpartum HIV-infected women at study close (year 3). We also compared year-3 population-level viral suppression and predictors of viral suppression among all 15 to 45-year-old women by arm.ResultsAt baseline, 92 and 93% of 15 to 45-year-old women tested for HIV: HIV prevalence was 12.6 and 12.3%, in intervention and control communities, respectively. Among HIV-infected women self-reporting pregnancy/live birth, prevalence of viral suppression was 42 and 44% at baseline, and 81 and 76% (P = 0.02) at year 3, respectively. Among all 15 to 45-year-old HIV-infected women, year-3 population-level viral suppression was higher in intervention (77%) versus control (68%; P < 0.001). Pregnancy/live birth was a predictor of year-3 viral suppression in control (P = 0.016) but not intervention (P = 0.43). Younger age was a risk factor for nonsuppression in both arms.ConclusionThe SEARCH intervention resulted in higher population viral suppression among pregnant/postpartum women than a control of baseline universal testing with ART eligibility for pregnant/postpartum women